Method and device of extracting sound source acoustic image body in 3d space

ABSTRACT

The invention provides a method and device of extracting a sound source acoustic image body in 3D space. The method includes: determining a spatial position of a sound source acoustic image and determining a speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image; calculating a correlation of signals of all sound tracks of the selected speaker in the horizontal direction and the vertical direction, and obtaining and storing a parameter set {IC H , IC v , Min{IC H , IC v }} of a acoustic image body, wherein the Min{IC H , IC v } is a smaller value between IC H  and IC v . The expression parameters of the acoustic image body obtained in the present invention are used for providing technical support for accurately restoring the size of the sound source acoustic image in a 3D audio live system, which solves the technical problem that the restored acoustic image in a 3D audio is excessively narrow at present.

TECHNICAL FIELD

The present invention belongs to the field of acoustics, in particular, relates to a method and device of extracting sound source acoustic image body in 3D space.

BACKGROUND

At the end of 2009, the 3D movie “Avatar” topped the box office in over 30 countries around the world, to early September 2010, the worldwide cumulative box office exceeds 2.7 billion US dollars. “Avatar” has been able to achieve such a brilliant performance at the box office, since it uses the new 3D effects production technologies to provide the shock effect to people's senses. Gorgeous graphics and realistic sound from “Avatar” not only shocked the audience, but also makes the industry have a assertion of “movie into the 3D era”. Not only that, it also spawned many more relevant video, recording, playback technologies and standards. In the International Consumer Electronics Show in January 2010 in Las Vegas, color TV giants had flaunted new TV which bring the people new expectations—3D has become a new focus of competition among the global major TV manufacturers. To achieve a better viewing experience, it needs 3D sound field hearing effect synchronized with the content of 3D video, in order to truly achieve an immersive audio-visual experience. Early 3D audio system (for example Ambisonics System), due to its complex structure, has high requirements for the capture and playback devices, and is difficult to be promoted. In recent years, NHK company in Japan launched a 22.2-channel system, which can reproduce the original 3D sound field through 24 speakers. In 2011, MPEG proceed to develop the international standard of the 3D audio, hopes to restore the 3D sound field through less speakers and headphones when reaching a certain coding efficiency, in order to promote the technology to the ordinary households. This shows the 3D audio and video technology has become research focus of the multimedia technology and important direction of further development.

However, the conventional 3D audio only focus on restoring the spatial location or a physical sound field of the sound source, and does not focus on restoring the size of the acoustic image of the sound source, especially the acoustic image body. In order to achieve better sound effect, it needs to restore the size of the acoustic image body accurately, and meanwhile in order to facilitate encoding and decoding and the other system processing, it also need to find the parameters representing sound source acoustic image body, then the original audio and video can be restored perfectly even after processed by the 3D audio system.

SUMMARY

The present invention addresses the deficiencies in the prior art, and proposes a method and device of extracting a sound source acoustic image body in 3D space.

The present invention provide a technical solution of a method of extracting a sound source acoustic image body in 3D space, the method comprises:

Step 1, determining a spatial position of a sound source acoustic image, which is achieved by:

-   -   processing time-frequency conversion for a signal of each         channel and processing the same sub-band division for each         channel; and with the listener as a spherical coordinate system         origin, for a speaker with the horizontal angle μ_(i) and         elevation angle η_(i), setting a vector p_(i)(k, n) re         presenting the time-frequency representation of the         corresponding signal,

${p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}$

-   -   wherein i refers to an index value of the speaker, k refers to a         frequency band index, n refers to a time domain frame number         index, g_(i)(k,n) refers to a intensity information of a         frequency domain point;     -   the horizontal angle μ_(i) and elevation angle η_(i) is         calculated using the following formula,

${\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}$ ${\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\left\lbrack {\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} + \left\lbrack {\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2}}}{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}$

-   -   wherein, N refers to a total number of the speakers, i values         for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ         and elevation angle η of the sound source acoustic image in k-th         frequency band of the n-th frame;     -   a distance ρ from the sound source acoustic image audio to the         origin of the spherical coordinate system takes the average         distance of distances from all the speakers to the listener;

step 2, determining the speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image;

step 3, calculating a correlation of signals of all sound tracks of the speakers selected at step 2 in the horizontal direction and the vertical direction, which is achieved by:

-   -   dividing the selected speakers into left part and right part         according to the location of the acoustic image, using the         vertical plane of the connecting line between the sound source         acoustic image and the listener as a projection plane,         calculating a sum of the components of the left and right         signals which are perpendicular to the projection plane         respectively, denoting the sums as P_(L) and P_(R) respectively,         and calculating the correlation IC_(H) of the left and right         signals as follows,

$\begin{matrix} {{IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}} & \; \end{matrix}$

-   -   dividing the selected speakers into upper part and lower part         according to the location of the acoustic image, using a plane         where the sound source acoustic image and the listener are         located as a projection plane, calculating a sum of the         components of the upper and lower signals which are         perpendicular to the projection plane respectively, denoting the         sums as P_(U) and P_(D) respectively, and calculating the         correlation IC_(v) of the upper and lower signals as follows,

${IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}$

step 4, obtaining and storing a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} of the acoustic image body, wherein the Min{IC_(H), IC_(v)} is a smaller value between ICH and ICv.

The present invention also provides a device of extracting a sound source acoustic image body in 3D space, the device comprises:

a spatial position extraction unit, configured to determine a spatial position of the sound source acoustic image by:

-   -   processing time-frequency conversion for a signal of each         channel and processing the same sub-band division for each         channel; and with the listener as a spherical coordinate system         origin, for a Speaker located in the horizontal angle μ_(i) and         elevation angle η_(i), setting a vector p_(i)(k,n) representing         the time-frequency representation of the corresponding signal,

${p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}$

-   -   wherein i refers to an index value of the speaker, k refers to a         frequency band index, n refers to a time domain frame number         index, g_(i)(k,n) refers to a intensity information of a         frequency domain point;     -   the horizontal angle μ_(i) and elevation angle η_(i) is         calculated using the following formula,

${\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}$ ${\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\left\lbrack {\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} + \left\lbrack {\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2}}}{\sum\limits_{i = 1}^{N}{{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}$

-   -   wherein, N refers to a total number of the speakers, i values         for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ         and elevation angle η of the sound source acoustic image in k-th         frequency band of the n-th frame;     -   a distance ρ from the sound source acoustic image audio to the         origin of the spherical coordinate system takes the average         distance of distances from all the speakers to the listener;

a speaker selecting unit, configured to determine the speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image;

a correlation extraction unit configured calculate a correlation of signals of all sound tracks of the speakers selected by the speaker selecting unit in the horizontal direction and the vertical direction, which is achieved by:

-   -   dividing the selected speakers into left part and right part         according to the location of the acoustic image, using the         vertical plane of the connecting line between the sound source         acoustic image and the listener as a projection plane,         calculating a sum of the components of the left and right         signals which are perpendicular to the projection plane         respectively, denoting the sums as P_(L) and P_(R) respectively,         and calculating the correlation IC_(H) of the left and right         signals as follows,

${IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}$

dividing the selected speakers into upper part and lower part according to the location of the acoustic image, using the vertical plane of the connecting line between the sound source acoustic image and the listener as a projection plane, calculating a sum of the components of the upper and lower signals which are perpendicular to the projection plane respectively, denoting the sums as P_(U) and P_(D) respectively, and calculating the correlation IC_(v) of the upper and lower signals as follows,

${IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}$

a acoustic image body characteristic storage unit, configured to obtain and store a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} of the acoustic image body, wherein the Min{IC_(H), IC_(v)} is a smaller value between IC_(H) and IC_(v).

The sound source acoustic image body refers to the sizes of the depth, length and height of the acoustic image in three dimensions relative to the listener. The present invention is directed to a multi-channel 3D audio system, and describes the size of the sound source acoustic image body by using correlations between different sound channels in three dimensions. The expression parameters of the acoustic image body obtained in the present invention are used for providing technical support for accurately restoring the size of the sound source acoustic image in a 3D audio live system, which solves the technical problem that the restored acoustic image in a 3D audio is excessively narrow at present.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is the calculation relationship between the speaker location and the signal in an embodiment of the present invention.

DETAILED DESCRIPTION

The present invention is further described in the follow with reference to the drawings and the embodiments.

The skilled person in the art use the computer-based software technology to run the procedure of the technical solution of the present invention automatically. The procedure of the embodiment comprises:

step 1, determining a spatial position of a sound source acoustic image, wherein with the listener as a spherical coordinate system origin, spherical coordinate of the speaker can be set as (ρ, μ, η), ρ is the distance from the speaker to the origin of the spherical coordinate system, μ is the horizontal angle and η is elevation angle, as shown in FIG. 1.

Wherein, with the listener as a reference point, orthogonal decomposition is implemented for each channel signal in the multi-channel system, to obtain the components on X, Y and Z axes of each sound channel in a 3D Cartesian coordinate system. The component of each sound channel is the decomposition of the original mono source on the sound channel. Thus after obtaining components of each channel on X, Y and Z axes, every components on X, Y and Z axes are added respectively, and the components of the original mono source with respective to the position of the listener are obtained . The embodiment is achieved by:

-   -   processing time-frequency conversion for a signal of each         channel and processing the same sub-band division for each         channel, wherein the time-frequency conversion and sub-band         division are implemented through the prior art.     -   As there are many speakers, spherical coordinate of each speaker         (ρ, μ, η) is denoted by (ρ_(i), μ_(i), η_(i)) by using the index         value as the subscript. For the speaker with the horizontal         angle μ_(i) and elevation angle η_(i) a vector p_(i)(k,n) may be         used to represent the time-frequency representation of the         corresponding signal, the calculation formula of p_(i)(k,n) i s         shown in formula (1):

$\begin{matrix} {{p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}} & (1) \end{matrix}$

-   -   wherein i refers to an index value of the speaker, k refers to a         frequency band index, n refers to a time domain frame number         index, g_(i)(k,n) refers to a intensity information of a         frequency domain point. The azimuth angle of the sound source         acoustic image can be divided into horizontal angle μ and         elevation angle η and can be calculated by formula (2) and (3):

$\begin{matrix} {{\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}} & (2) \\ {{\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\begin{matrix} {\left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} +} \\ \left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} \end{matrix}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}} & (3) \end{matrix}$

-   -   wherein, N refers to a total number of the speakers, i values         for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ         and elevation angle η0 of the sound source acoustic image in         k-th frequency band of the n-th frame;     -   Thus the horizontal angle μ and elevation angle η of the sound         source acoustic image may be obtained, because the speakers are         distributed with the listener as the center, a distance ρ from         the sound source acoustic image audio to the origin of the         spherical coordinate system takes the average distance of         distances from all the speakers to the listener, typically,         ρ=ρ1=ρ2= . . . =ρN.

step 2, determining the speaker beside the spatial position where the sound source acoustic image is located.

After the spatial position (ρ, μ, η ) for restoring the sound source acoustic image is determined, the speaker beside the sound source acoustic image is found according to the position of the sound source acoustic image.

In specific implementation, the speakers are ordered from proximal to distal according to the distance from each speaker (ρ_(i), μ_(i), η_(i)) to the sound source acoustic image, then the nearest speakers are selected. The speakers are selected flexibly according to the actual situation, and it is generally advisable to select 4-8 speakers.

step 3, calculating a correlation of signals of all sound tracks of the speakers selected at step 2 in the horizontal direction and the vertical direction, wherein the correlation indicates the size of acoustic image in the horizontal and vertical directions.

-   -   the selected speakers is divided into left part and right part         according to the location of the acoustic image, by setting         P_(i) as the frequency domain value of the i-th channel of the         sound source and using the vertical plane of the connecting line         between the sound source acoustic image and the listener as a         projection plane, a sum of the components of the left and right         signals which are perpendicular to the projection plane is         calculated respectively, and the sums are denoted as P_(L) and         P_(R) respectively. That is, all speakers selected at step 2 on         the left side of the acoustic image are selected to obtain the         components of the corresponding frequency domain values for each         speaker P_(i), which are respectively perpendicular to the plane         of projection, and then the components are summed to obtain         P_(L); all speakers selected at step 2 on the right side of the         acoustic image are selected to obtain the components of the         corresponding frequency domain values for each speaker P_(i),         which are respectively perpendicular to the plane of projection,         and then the components are summed to obtain P_(R). And the         correlation IC_(H) of the left and right signals is calculated,         as shown in formula (4):

$\begin{matrix} {{IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}} & (4) \end{matrix}$

Similarly, the selected speakers are divided into upper part and lower part according to the location of the acoustic image, by using the plane where the sound source acoustic image and the listener are located and which is perpendicular to the vertical plane mentioned above as a projection plane, a sum of the components of the upper and lower signals which are perpendicular to the projection plane is calculated respectively, and the sums are denoted as P_(U) and P_(D) respectively. That is, all speakers selected at step 2 on the upper side of the acoustic image are selected to obtain the components of the corresponding frequency domain values for each speaker P_(i), which are respectively perpendicular to the plane of projection, and then the components are summed to obtain P_(U); all speakers selected at step 2 on the lower side of the acoustic image are selected to obtain the components of the corresponding frequency domain values for each speaker P_(i), which are respectively perpendicular to the plane of projection, and then the components are summed to obtain P_(D). And the correlation IC_(v) of the upper and lower signals is calculated, as shown in formula (5):

$\begin{matrix} {{IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}} & (5) \end{matrix}$

Thus parameters indicative of the size of the acoustic image in the horizontal and vertical directions may be obtained , because People's perception of distance is not sensitive enough, the distance parameter may be represented by the smaller value between IC_(H) and IC_(v), namely Min{IC_(H), IC_(v)}.

According to the above method, according to the horizontal angle μ and elevation angle η of each band of signal of each frame, the acoustic image body of each band of signal of each frame is obtained accordingly.

In specific implementation, th e extracted acoustic image body may be represented by a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} and may be stored, to restore the sound source acoustic image.

The technical solution of the present invention may be applied with the software modular technology, to implement as a device. The embodiment of the present invention accordingly provides a device of extracting a sound source acoustic image body in 3D space, the device comprises:

a spatial position extraction unit, configured to determine a spatial position of the sound source acoustic image by:

-   -   processing time-frequency conversion for a signal of each         channel and processing the same sub-band division for each         channel; and with the listener as a spherical coordinate system         origin, for a speaker with the horizontal angle μ_(i) and         elevation angle η_(i), setting a vector p_(i)(k,n) re presenting         the time-frequency representation of the corresponding signal,

${p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}$

-   -   wherein i refers to an index value of the speaker, k refers to a         frequency band index, n refers to a time domain frame number         index, g_(i)(k,n) refers to a intensity information of a         frequency domain point;     -   the horizontal angle μ_(i) and elevation angle η_(i) is         calculated using the following formula,

${\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}$ ${\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\begin{matrix} {\left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} +} \\ \left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} \end{matrix}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}$

-   -   wherein, N refers to a total number of the speakers, i values         for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ         and elevation angle η of the sound source acoustic image in k-th         frequency band of the n-th frame;     -   a distance ρ from the sound source acoustic image audio to the         origin of the spherical coordinate system takes the average         distance of distances from all the speakers to the listener;

a speaker selecting unit, configured to determine the speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image;

a correlation extraction unit configured calculate a correlation of signals of all sound tracks of the speakers selected by the speaker selecting unit in the horizontal direction and the vertical direction, which is achieved by:

-   -   dividing the selected speakers into left part and right part         according to the location of the acoustic image, using the         vertical plane of the connecting line between the sound source         acoustic image and the listener as a projection plane,         calculating a sum of the components of the left and right         signals which are perpendicular to the projection plane         respectively, denoting the sums as P_(L) and P_(R) respectively,         and calculating the correlation IC_(H) of the left and right         signals as follows,

${IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}$

-   -   dividing the selected speakers into upper part and lower part         according to the location of the acoustic image, using a plane         where the sound source acoustic image and the listener are         located as a projection plane, calculating a sum of the         components of the upper and lower signals which are         perpendicular to the projection plane respectively, denoting the         sums as P_(U) and P_(D) respectively, and calculating the         correlation IC_(v) of the upper and lower signals as follows,

${IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}$

a acoustic image body characteristic storage unit, configured to obtain and store a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} of the acoustic image body, wherein the Min{IC_(H), IC_(v)} is a smaller value between IC_(H) and IC_(v), IC_(H), IC_(v), Min{IC_(H), IC_(v)} are used to identify the characteristic of the depth, length and height of the acoustic image in three dimensions respectively.

The above-described examples of the present invention is merely to illustrate the implementation of method of the present invention, within the technical scope disclosed in the present invention, any person skilled in the art can easily think of the changes and alterations, and the scope of the invention should be covered by the protection scope defined by the appended claims. 

What is claimed is:
 1. A method of extracting a sound source acoustic image body in 3D space, the method comprises: step 1, determining a spatial position of a sound source acoustic image, which is achieved by: processing time-frequency conversion for a signal of each channel and processing the same sub-band division for each channel; and with the listener as a spherical coordinate system origin, for a speaker with the horizontal angle μ_(i) and elevation angle η_(i), setting a vector p_(i)(k,n) re presenting the time-frequency representation of the corresponding signal, ${p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}$ wherein i refers to an index value of the speaker, k refers to a frequency band index, n refers to a time domain frame number index, g_(i)(k,n) refers to a intensity information of a frequency domain point; the horizontal angle μ_(i) and elevation angle η_(i) is calculated using the following formula, ${\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}$ ${\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\begin{matrix} {\left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} +} \\ \left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} \end{matrix}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}$ wherein, N refers to a total number of the speakers, i values for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ and elevation angle η of the sound source acoustic image in k-th frequency band of the n-th frame; a distance ρ from the sound source acoustic image audio to the origin of the spherical coordinate system takes the average distance of distances from all the speakers to the listener; step 2, determining the speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image; step 3, calculating a correlation of signals of all sound tracks of the speakers selected at step 2 in the horizontal direction and the vertical direction, which is achieved by: dividing the selected speakers into left part and right part according to the location of the acoustic image, using the vertical plane of the connecting line between the sound source acoustic image and the listener as a projection plane, calculating a sum of the components of the left and right signals which are perpendicular to the projection plane respectively, denoting the sums as P_(L) and P_(R) respectively, and calculating the correlation IC_(H) of the left and right signals as follows, ${IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}$ dividing the selected speakers into upper part and lower part according to the location of the acoustic image, using a plane where the sound source acoustic image and the listener are located as a projection plane, calculating a sum of the components of the upper and lower signals which are perpendicular to the projection plane respectively, denoting the sums as P_(U) and P_(D) respectively, and calculating the correlation IC_(v) of the upper and lower signals as follows, ${IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}$ step 4, obtaining and storing a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} of the acoustic image body, wherein the Min{IC_(H), IC_(v)} is a smaller value between IC_(H) and IC_(v).
 2. A device of extracting a sound source acoustic image body in 3D space, the device comprises: a spatial position extraction unit, configured to determine a spatial position of the sound source acoustic image by: processing time-frequency conversion for a signal of each channel and processing the same sub-band division for each channel; and with the listener as a spherical coordinate system origin, for a speaker with the horizontal angle μ_(i) and elevation angle η_(i), setting a vector p_(i)(k,n) re presenting the time-frequency representation of the corresponding signal, ${p_{i}\left( {k,n} \right)} = {{g_{i}\left( {k,n} \right)} \cdot \begin{bmatrix} {\cos \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; {\mu_{i} \cdot \cos}\; \eta_{i}} \\ {\sin \; \eta_{i}} \end{bmatrix}}$ wherein i refers to an index value of the speaker, k refers to a frequency band index, n refers to a time domain frame number index, g_(i)(k,n) refers to a intensity information of a frequency domain point; the horizontal angle μ_(i) and elevation angle η_(i) is calculated using the following formula, ${\tan \; {\mu \left( {k,n} \right)}} = \frac{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}}$ ${\tan \; {\eta \left( {k,n} \right)}} = \frac{\sqrt{\begin{matrix} {\left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \cos}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} +} \\ \left\lbrack {\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; {\mu_{i} \cdot \cos}\; \eta_{i}}} \right\rbrack^{2} \end{matrix}}}{\sum\limits_{i = 1}^{N}\; {{{g_{i}\left( {k,n} \right)} \cdot \sin}\; \eta_{i}}}$ wherein, N refers to a total number of the speakers, i values for 1,2 . . . N, μ (k, n), η (k, n) i.e., the horizontal angle μ and elevation angle η of the sound source acoustic image in k-th frequency band of the n-th frame; a distance ρ from the sound source acoustic image audio to the origin of the spherical coordinate system takes the average distance of distances from all the speakers to the listener; a speaker selecting unit, configured to determine the speaker beside the spatial position where the sound source acoustic image is located according to the determined spatial position (ρ, μ, η) of the sound source acoustic image; a correlation extraction unit configured calculate a correlation of signals of all sound tracks of the speakers selected by the speaker selecting unit in the horizontal direction and the vertical direction, which is achieved by: dividing the selected speakers into left part and right part according to the location of the acoustic image, using the vertical plane of the connecting line between the sound source acoustic image and the listener as a projection plane, calculating a sum of the components of the left and right signals which are perpendicular to the projection plane respectively, denoting the sums as P_(L) and P_(R) respectively, and calculating the correlation IC_(H) of the left and right signals as follows, ${IC}_{H} = \frac{{cov}\left( {P_{L},P_{R}} \right)}{\sqrt{{cov}\left( {P_{L},P_{L}} \right)} \cdot \sqrt{{cov}\left( {P_{R},P_{R}} \right)}}$ dividing the selected speakers into upper part and lower part according to the location of the acoustic image, using a plane where the sound source acoustic image and the listener are located as a projection plane, calculating a sum of the components of the upper and lower signals which are perpendicular to the projection plane respectively, denoting the sums as P_(U) and P_(D) respectively, and calculating the correlation IC_(v) of the upper and lower signals as follows, ${IC}_{V} = \frac{{cov}\left( {P_{U},P_{D}} \right)}{\sqrt{{cov}\left( {P_{U},P_{U}} \right)} \cdot \sqrt{{cov}\left( {P_{D},P_{D}} \right)}}$ a acoustic image body characteristic storage unit, configured to obtain and store a parameter set {IC_(H), IC_(v), Min{IC_(H), IC_(v)}} of the acoustic image body, wherein the Min{IC_(H), IC_(v)} is a smaller value between IC_(H) and IC_(v). 